{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Compare sampler combining over- and under-sampling\n",
    "\n",
    "This example shows the effect of applying an under-sampling algorithms after\n",
    "SMOTE over-sampling. In the literature, Tomek's link and edited nearest\n",
    "neighbours are the two methods which have been used and are available in\n",
    "imbalanced-learn.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n",
    "# License: MIT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Automatically created module for IPython interactive environment\n"
     ]
    },
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'seaborn'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "Cell \u001b[1;32mIn [3], line 4\u001b[0m\n\u001b[0;32m      1\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;18m__doc__\u001b[39m)\n\u001b[0;32m      3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mmatplotlib\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpyplot\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01mplt\u001b[39;00m\n\u001b[1;32m----> 4\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m \u001b[38;5;21;01mseaborn\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m \u001b[38;5;21;01msns\u001b[39;00m\n\u001b[0;32m      6\u001b[0m sns\u001b[38;5;241m.\u001b[39mset_context(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mposter\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
      "\u001b[1;31mModuleNotFoundError\u001b[0m: No module named 'seaborn'"
     ]
    }
   ],
   "source": [
    "print(__doc__)\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_context(\"poster\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset generation\n",
    "\n",
    "We will create an imbalanced dataset with a couple of samples. We will use\n",
    ":func:`~sklearn.datasets.make_classification` to generate this dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_classification\n",
    "\n",
    "X, y = make_classification(\n",
    "    n_samples=100,\n",
    "    n_features=2,\n",
    "    n_informative=2,\n",
    "    n_redundant=0,\n",
    "    n_repeated=0,\n",
    "    n_classes=3,\n",
    "    n_clusters_per_class=1,\n",
    "    weights=[0.1, 0.2, 0.7],\n",
    "    class_sep=0.8,\n",
    "    random_state=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, ax = plt.subplots(figsize=(6, 6))\n",
    "_ = ax.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8, edgecolor=\"k\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function will be used to plot the sample space after resampling\n",
    "to illustrate the characteristic of an algorithm.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter\n",
    "\n",
    "\n",
    "def plot_resampling(X, y, sampler, ax):\n",
    "    \"\"\"Plot the resampled dataset using the sampler.\"\"\"\n",
    "    X_res, y_res = sampler.fit_resample(X, y)\n",
    "    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n",
    "    sns.despine(ax=ax, offset=10)\n",
    "    ax.set_title(f\"Decision function for {sampler.__class__.__name__}\")\n",
    "    return Counter(y_res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following function will be used to plot the decision function of a\n",
    "classifier given some data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def plot_decision_function(X, y, clf, ax):\n",
    "    \"\"\"Plot the decision function of the classifier and the original data\"\"\"\n",
    "    plot_step = 0.02\n",
    "    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n",
    "    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n",
    "    xx, yy = np.meshgrid(\n",
    "        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n",
    "    )\n",
    "\n",
    "    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    ax.contourf(xx, yy, Z, alpha=0.4)\n",
    "    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n",
    "    ax.set_title(f\"Resampling using {clf[0].__class__.__name__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":class:`~imblearn.over_sampling.SMOTE` allows to generate samples. However,\n",
    "this method of over-sampling does not have any knowledge regarding the\n",
    "underlying distribution. Therefore, some noisy samples can be generated, e.g.\n",
    "when the different classes cannot be well separated. Hence, it can be\n",
    "beneficial to apply an under-sampling algorithm to clean the noisy samples.\n",
    "Two methods are usually used in the literature: (i) Tomek's link and (ii)\n",
    "edited nearest neighbours cleaning methods. Imbalanced-learn provides two\n",
    "ready-to-use samplers :class:`~imblearn.combine.SMOTETomek` and\n",
    ":class:`~imblearn.combine.SMOTEENN`. In general,\n",
    ":class:`~imblearn.combine.SMOTEENN` cleans more noisy data than\n",
    ":class:`~imblearn.combine.SMOTETomek`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from imblearn.over_sampling import SMOTE\n",
    "from imblearn.combine import SMOTEENN, SMOTETomek\n",
    "from imblearn.pipeline import make_pipeline\n",
    "from sklearn.svm import LinearSVC\n",
    "\n",
    "samplers = [SMOTE(random_state=0), SMOTEENN(random_state=0), SMOTETomek(random_state=0)]\n",
    "\n",
    "fig, axs = plt.subplots(3, 2, figsize=(15, 25))\n",
    "for ax, sampler in zip(axs, samplers):\n",
    "    clf = make_pipeline(sampler, LinearSVC()).fit(X, y)\n",
    "    plot_decision_function(X, y, clf, ax[0])\n",
    "    plot_resampling(X, y, sampler, ax[1])\n",
    "fig.tight_layout()\n",
    "\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
